Connected component identification and cluster update on GPU 
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Cluster identification tasks occur in a multitude of contexts in physics and engineering such as, 
for instance, cluster algorithms for simulating spin models, percolation simulations, segmentation 
problems in image processing, or network analysis. While it has been shown that graphics processing 
units (CPUs) can result in speedups of two to three orders of magnitude as compared to serial codes 
on CPUs for the case of local and thus naturally parallelized problems such as single-spin flip update 
simulations of spin models, the situation is considerably more complicated for the non-local problem 
of cluster or connected component identification. I discuss the suitability of different approaches 
of parallelization of cluster labeling and cluster update algorithms for calculations on GPU and 
compare to the performance of serial implementations. 

PACS numbers: 05.10.Ln, 75.10.Hk, 05.10.-a 



I. INTRODUCTION 

Due to their manifold applications in statistical and 
condensed matter physics ranging from the description 
of magnetic systems over models for the gas-liquid tran- 
sition to biological problems, classical spin models have 
been very widely studied in the past decades. Since 
exact solutions are only available for a few exceptional 
cases PP, with the steady increase in available computer 
power and the advancement of simulational techniques, 
in many cases computer simulations have become the 
tool of choice even above the more traditional varia- 
tional and perturbative techniques [2J. The workhorse 
of Monte Carlo simulations in the form of the Metropo- 
lis algorithm [3] is extremely general and robust, but 
suffers from problems of slowed down dynamics in the 
vicinity of phase transitions, or for systems with complex 
free-energy landscapes. For the case of continuous phase 
transitions, critical slowing down is observed with auto- 
correlation times increasing as r ~ L z with z 2 in the 
vicinity of the critical point. This divergence of temporal 
correlations is a consequence of the divergent critical cor- 
relations in space, compared to which local modifications 
of the configuration become inefficient. An exceptionally 
successful solution of this problem is given by a class of 
cluster-update algorithms working on stochastically de- 
fined connected regions of spins with identical or similar 
orientation [3HZ], which allow for a significant reduction 
of the dynamical critical exponent z over the local value 
z ss 2 and can thus easily lead to an effective speed gain 
in excess of 10 6 for practically considered system sizes. 
Incidentally, the practical task of cluster identification 
resulting from the probabilistic description of the prob- 
lem as a bond-correlated percolation model is identical 
to that encountered in image segmentation or computer 
vision, where neighboring pixels should be lumped to- 
gether according to their colors, a problem which can be 
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mapped to the Potts model [HI |!|] ■ Also, numerical sim- 
ulations of percolation problems, with their wide range 
of realizations from fluids in porous media to epidemic 
spreading [lOj , must deal with a very similar problem of 
cluster identification (see, e.g., Ref. [2]). Further appli- 
cations occur in network analysis, particle tracking or the 
identification of structures such as droplets in condensed 
matter. Efficient implementations of cluster labeling al- 
gorithms are, therefore, of significant interest for a num- 
ber of different applications in scientific computing. 

In parallel to the invention of new simulation algo- 
rithms, the need for strong computing power for tack- 
ling hard problems has prompted scientists to always 
make the best use of the available computer resources 
of the time, be it regular PCs, vector computers or Be- 
owulf clusters. For the case of simulations of spin models, 
for instance, a number of special purpose computers has 
been devised, including machines for local updates such 
as JANUS for spin glasses [T2] and variants such as the 
"cluster processor" using cluster-update algorithms |13) . 
While these were (and are) highly successful in their spe- 
cific application fields, their design and realization is a 
rather challenging endeavor, costly in terms of monetary 
as well as human resources. It is therefore desirable to 
search for a less application specific, but still highly per- 
formant platform for massively parallel scientific comput- 
ing that is less expensive in terms of its acquisition as well 
as its power consumption and cooling requirements than 
traditional cluster computers. An architecture meeting 
those standards has become available in recent years with 
the advent of general purpose computing on graphics 
processing units (GPUs) [El [15] . With the availability 
of convenient application programming interfaces (APIs) 
for GPU computing, most notably NVIDIA CUDA and 
OpenCL |15j , the programming effort does no longer dra- 
matically exceed that of CPU based parallel machines. 
Still, for efficient implementations architectural peculiar- 
ities of these devices, in particular the organization of 
compute units (cores) in groups (multiprocessors) with 
reduced synchronization capabilities between multipro- 
cessors and the pyramid of memories with latencies, sizes 
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and access scopes decreasing from base to tip, need to be 
taken into account. For the case of spin models, a wide 
range of simulation algorithms with local updates has 
been previously implemented on GPU |16H19| . where for 
the implementations reported in Refs. [HI [2DJ HI] signif- 
icant speedups of two to three orders of magnitude as 
compared to serial CPU codes have been reported. An 
efficient parallelization of non-local algorithms and clus- 
ter labeling is significantly more challenging, however, in 
particular for the case of cluster updates for spin models 
close to criticality, where the relevant clusters undergo 
a percolation transition and are therefore spanning the 
whole system [2"2H2"§] . 

The implementations discussed here have been realized 
within the NVIDIA CUDA [3U] framework with bench- 
marks performed on the GTX 480, GTX 580 and Tesla 
M2070 GPUs. While some of the details are specific to 
this setup, the algorithmic approaches discussed are fairly 
general and could easily applied to other GPU devices 
or realized with different APIs such as OpenCL. For an 
introduction into the details of the GPU hardware and 
the corresponding programming models, the reader is re- 
ferred to the available textbooks (see, e.g, Ref. [TS]) and 
previous articles by the present author [TH [20] • 

The rest of the paper is organized as follows. In Sec- 
tion II GPU implementations of the cluster algorithm 
of Swendsen and Wang [JJ are discussed. The cluster de- 
composition of the complete spin lattice necessary here is 
identical to that of a corresponding image segmentation 
problem or percolation simulation. Section III is devoted 
to the case of the single-cluster variant suggested by Wolff 
[5]. Finally, Section IV contains my conclusions. 

II. SWENDSEN- WANG ALGORITHM 

In this paper, I focus on the ferromagnetic g-state Potts 
model with Hamiltonian 

H = -jJ2^ lSj , (1) 

(ij) 

where Sj G {1, . . . , q} denote the spin variables, J > 
is the exchange coupling, and the sum extends over all 
bonds of an underlying graph, most commonly a regu- 
lar lattice. In dimensions d > 1, the model undergoes a 
transition from a disordered phase at high temperatures 
to an ordered phase where one of the q states prevails at 
low temperatures [31]. For d = 2, the transition is con- 
tinuous for q < 4 and first order for q > 4, while in d = 3 
it is first order for any q > 3. The special case q = 2 is 
equivalent to the celebrated Ising model. A local Monte 
Carlo simulation of the Potts model proceeds by itera- 
tively changing the orientation of randomly chosen spin 
variables in accordance with the detailed balance condi- 
tion [2]. In contrast, the cluster algorithm of Swendsen 
and Wang [4 updates connected components of (usually) 
more than one spin and is based on the following trans- 
formation of the partition function due to Fortuin and 



Kasteleyn [52"], 

Z=]Texp /3J^<5 SiSj (2) 

= EII e " J [( 1 -p)+^] ( 3 ) 

{*<} (ij) 
{"ij} {s;} (ij) 

where /3 denotes the inverse temperature and p = 1 — 
e _/3J . In Eq. a set of auxiliary bond variables 

riij 6 {0, 1} is introduced, where = whenever 
Si =/= Sj and n 13 = 1 with probability p for Si = Sj. 
The resulting stochastically defined clusters are therefore 
subsets of the geometric clusters of parallel spins. Using 
a graphical expansion of the term in square brackets in 
Eq. @ and summing over the spin configurations {sj}, 
it can be shown that the model is equivalent to a gener- 
alized percolation model with partition function [321 133] 

Z = e ,3J p b ({™ij})(i - (5) 

known as the random-cluster model. Here, b({riij}) de- 
notes the number of activated edges resulting from the 
bond variables ny, n({riij} is the number of connected 
components of the induced subgraph, and £ is the total 
number of edges in the underlying graph or lattice. From 
the percolation representation ([5| it is clear [33] that the 
stochastic clusters induced by the bond variables (and 
not the geometric clusters of like spins) undergo a per- 
colation transition at the thermal transition point, and 
hence it is these structures that should be updated to 
efficiently decorrelate the system close to criticality. 

Utilizing the representation Q the algorithm by 
Swendsen and Wang alternatingly updates spins and 
bond variables follows: 

1. For a given spin configuration set = for each 
bond with s$ ^ Sj. Set ny = 1 and = with 
probabilities p and 1— p, respectively, for each bond 
with Si — Sj. 

2. Identify the connected components of the subgraph 
of the lattice induced by the bond variables n^-. 

3. Choose a new spin orientation randomly in 
{1, . . . ,q} for each connected component and up- 
date the spin variables s$ accordingly. 

Since clusters of single spins are possible, this update 
is trivially ergodic. It is straightforward to show that 
detailed balance is fulfilled [U [7] . Hence, the Swendsen- 
Wang (SW) dynamics forms a valid Markov chain Monte 
Carlo algorithm of the Potts model. Autocorrelations 
are dramatically reduced as compared to local spin flips. 
A rigorous bound for the dynamical critical exponent is 
Zint > ol/v |35j . where z- m t is the exponent of the scaling 
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of the integrated autocorrelation time and a and v are the 
(static) critical exponents of the specific heat and the cor- 
relation length, respectively. This bound is close to sharp 
in two dimensions |36) . but not in d — 3 where, never- 
theless, significant reductions in autocorrelations and the 
dynamical critical exponent z are observed. 

Attempting a highly parallel GPU implementation of 
the SW algorithm, it is clear that the bond activation in 
step 1 as well as the cluster flipping in step 3 can be rather 
easily parallelized as they are perfectly local operations. 
In contrast, the cluster identification in step 3 must deal 
with structures spanning the whole system, in particular 
for simulations close to criticality which are the main 
strength of cluster updates. This is also the crucial step 
for further applications of cluster identification such as 
the image segmentation problem mentioned above. The 
total run time for a single update of the spin lattice with 
the SW algorithm on a single GPU therefore decomposes 
as 



J SW — ± activate 



identify 



rpp 
J flip- 



(6) 



We distinguish these times from the corresponding se- 
rial run times Ig W , r activatG , etc. for single-threaded cal- 
culations. For definiteness, the implementation is dis- 
cussed in some detail for the specific example of the Potts 
model on the square lattice of edge length L with peri- 
odic boundary conditions. Generalizations to three di- 
mensions or other lattice types are straightforward. 



A. Bond activation 

We use an array of 2L 2 char variables to represent the 
bond activation states . For the GPU implementation 
using CUDA [T5] , bond activation is performed by a first 
kernel, prepare Jbonds () . Given a configuration of the 
spins Si , for each bond an expression of the form 



1 if Si — Sj and r < p 
otherwise 



(7) 



needs to be evaluated, where r £ (0, 1) is a uniform 
(pseudo) random number, and p = 1 — e 



-PJ 



To en- 



able parallelism, the system is broken up into tiles of 
B 2 = B x x By spins, and each tile is assigned to an 
independent thread block. If we denote £ x = L/B x , 
l y = L/By and £ 2 = £ x £ y the number of tiles, the ex- 
pected parallel run time behaves as 



rpp 



£ 2 B 2 
min(^ 2 , n) min(_B 2 /fc, m) ' 



(8) 



where n denotes the number of multiprocessors (n = 14 
for Tesla M2070, n = 15 for GTX 480, n = 16 for GTX 
580), m is the number of cores per multiprocessor (n = 32 
for all three cards), and k is the number of sites as- 
signed to each thread [57]. For large systems, £ 2 > n and 
B 2 /k > to, Eq. (p§ reduces to Tf ctivate ~ £ 2 B 2 = L 2 . As 



discussed in detail in Refs. [IS1 ED] 7 each thread requires 
its own instance of a random number generator (RNG) 
to prevent the formation of a performance bottleneck. 
Due to the resulting large number of RNG instances (for 
the case of large systems), one requires a generator with 
a small state comprising, ideally, not more than a few 
bytes. This precludes the use of high-quality, but large 
state generators such as Mersenne twister [35] in applica- 
tions of the type considered here. Additionally one needs 
to ensure that the thus created streams of random num- 
bers are sufficiently uncorrelated with each other. Suit- 
able generator types for this purpose are, for instance, ar- 
rays of linear congrucntial generators with random seeds, 
which are fast but might not produce random numbers 
of sufficient quality [THl (T5J 135], generalized lagged Fi- 
bonacci generators jT5J, or the Marsaglia generator as 
suggested in Ref. [19]. As the cluster identification step, 
which does not require random numbers, dominates the 
parallel runtime of the algorithm, RNG speed is not as 
important as in local update simulations on GPUs. For 
the benchmarks reported below, I used an array of 32-bit 
linear congruential generators. Statistically significant 
deviations from the exact results j3U] for the q = 2 Potts 
model at criticality have not been observed. 

An analysis of the kernel with CUDA's Compute Vi- 
sual Profiler [3U] shows that its performance is compute 
bound. Still, memory performance can be improved by 
using an appropriate memory layout ensuring that reads 
of subsequent threads in the spin and bond arrays map 
to consecutive locations in global memory to ensure coa- 
lescence of memory requests |15) . With a linear memory 
arrangement these requirements are best met when us- 
ing tiles with B, x >• B y . Best results for systems with 
L > 256 are found here for B z — 256, B y = 4 (con- 
sidering only lattice sizes L = 2", n £ N). Evaluat- 
ing the acceptance criterion leads to unavoidable thread 
divergence, but the effects are not very dramatic here. 
The asymptotic performance of the kernel with one spin 
per thread, k = 1, is T activatc / L 2 = 0.66 ns on the 
GTX 480 (assuming full loading of the multiprocessors 
which is reached for sufficiently large systems). An alle- 
viating effect on thread divergence and memory limita- 
tions is reached by assigning several spin pairs (bonds) 
to each thread. Two versions have been considered here, 
either assigning a (square) sub-block of four spins to 
each thread (k — 4), which brings down the updating 
time to 7 1 a J ctivatc /L 2 = 0.46 ns, or assigning only B x 
threads per tile, each of which has to update k = B y 
spins, leading to the same asymptotic performance of 

^Tctivate/^ 2 = °- 46 ns on the GTX 480 - The better Per- 
formance of these variant kernels has to do with the pos- 
sibility of pre-fetching of data into registers while arith- 
metic operations are being performed. The same kernel 
is used to also initialize the cluster labels (see below). 
Note that the relatively lower performance of this kernel 
as compared to the Metropolis update of the Ising model 
reported in Ref. [15] of about 0.13 ns per spin (without 
multi-hit updates) on the same hardware is explained by 
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FIG. 1. (Color online) Cluster identification on a 64 x 64 tile 
using a breadth-first search. The already labeled sites are 
indicated in blue (dark squares) while the current wave front 
of unvisited neighbors is shaded in red (light squares). 



the 6-fold increase in memory writes (two chars and one 
int versus one char) and the use of two random numbers 
(instead of one) per spin. 



B. Cluster labeling on tiles 

To allow for an efficient use of parallel hardware, clus- 
ter labeling is first performed on (square) tiles of B x B 
spins and cluster labels are consolidated over the whole 
lattice in a second step (see Sec. II C below) 22 -241 l2"BT 
|2"8] . Hence the time for cluster identification naturally 
breaks up into two contributions, 



Breadth-first search 



In breadth- first search (or "ants in the labyrinth" ) the 
unvisited neighbors of a starting vertex or seed that are 
connected by activated bonds are examined and stored in 
a first-in-first-out (FIFO) data structure (a queue). Sub- 
sequently, nodes are removed from the queue in the order 
they have been stored and examined in the same fashion 
as the initial vertex. This leads to a layered growth of 
the identified part of a cluster as illustrated in Fig. [I] 
The complete set of clusters is being identified by seed- 
ing a new cluster at each node of the lattice that is not 
yet part of a previously identified cluster. Information 
about the cluster structure is stored in an array of clus- 
ter labels, where originally each cluster label is initialized 
with the site number on the lattice and cluster labels are 
set to that of the seed site on growing the cluster, cf. 
Fig. [T] While this approach is very general (it can be 
applied without changes to any graph) and well suited 
for serial calculations, it is not very suitable for paral- 
lelization [42]. Parallelism can be implemented in the 
examination of different neighbors of a site and in pro- 
cessing the sites on the wave front of the growing cluster. 
To avoid race conditions and achieve consistency of the 
data structures, however, locks or atomic operations are 
required, considerably complicating the code. Addition- 
ally, the number of (quasi) independent tasks is highly 
variable as the length of the wave fronts is fluctuating 
quite strongly. For the case of a parallel identification of 
all clusters as necessary for the SW algorithm and image 
segmentation, this approach is hence not very well suited 
for a GPU implementation. A parallel implementation 
will be discussed below in the context of the single-cluster 
(or Wolff) variant of the algorithm in Sec. Ill however. 



TP = T p + T p 

identify local global* 



(9) 



The parallel run time of this kernel, local_BFS(), em- 
ploying one thread per block performing cluster identifi- 
cation in a tile of edge length B, is therefore expected to 
scale as 



In the field of simulations of spin systems (and percola- 
tion), the standard technique for cluster identification is 
due to Hoshen and Kopelman [TT]. Although originally 
not formulated in this context, it turns out to be a spe- 
cial version of the class of union- and- find algorithms well 
known in theoretical computer science |41j . Time and 
storage requirements for this approach scale linearly with 
the number N — L 2 of sites. A somewhat more "natural" 
approach consists of a set of breadth-first searches on the 
graph of bonds, growing the clusters in a layered fashion. 
While storage requirements are super-linear in N (and 
might be as large as ./V 2 depending on the structure of 
the underlying graph), computing time scales still linear 
in N and implementations are typically very straight- 
forward and efficient. A third approach considered here, 
dubbed self-labeling [23] , is very inefficient regarding (se- 
rial) computing time, but very well suited for paralleliza- 
tion. 



I 2 

T -° cal ~ mm(e 2 ,n) B2 - 

The measured run times for i 2 > n follow this expec- 
tation, resulting in perfectly linear scaling of the time 
T p ocal /£ 2 per tile with the number B 2 of tile sites, cf. 
Fig. [2] Since only a maximum of 8 thread blocks can 
be simultaneously active on each multiprocessor on cur- 
rent generation NVIDIA GPUs [15], however, 24 of the 
32 cores of each multiprocessor are idling, leading to 
rather low performance. The asymptotic maximum per- 
formance for large system sizes (leading to an optimum 
effect of latency hiding through the scheduler) on the 
GTX 480 is at around T p oc jL 2 = 13.4 ns for this kernel, 
local_BFS(). 
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FIG. 2. (Color online) Parallel average run time for local 
cluster labeling on a 4096 x 4096 square lattice in tiles of 
edge length B. Data are for the q — 2 states Potts model at 
the critical point. Breadth- first search and tree-based union- 
and-find are (up to logarithmic corrections) proportional to 
the number B 2 of sites, while self-labeling exhibits scaling 
proportional to S 2+t ™ D « _B 3 ' 08 . The weaker scaling propor- 
tional to B dn,iD w B l os of self-labeling for small B is due to 
under-utilization of GPU cores (see main text) . The lines are 
fits of the power law 1f ocal /f = AB K with the indicated fixed 
exponents to the data. 



2. Union-and-find algorithms 

It is a well known problem in computer science to par- 
tition a set of elements into disjoint subsets according to 
some connectedness or identity criterion. A number of 
efficient algorithm for this so-called union-and-find prob- 
lem have been developed [21] • Consider a set of N el- 
ements denoted as vertices in a graph which, initially, 
has no edges. Now, a number of edges is sequentially 
inserted into the graph and the task is to successively 
update a data structure that contains information about 
the connected components resulting from the edge inser- 
tion. Obviously, our cluster identification task is a special 
case of this problem. In a straightforward implementa- 
tion one maintains a forest of spanning trees where each 
node carries a pointer to its parent in the tree, unless 
it is the tree root which points to itself. On insertion 
of an edge one determines the roots of the two adjacent 
vertices by successively walking up the respective tree 
structures {find). If the two roots found are the same, 
the inserted edge was internal and no further action is 
required. If two different roots were found, the edge was 
external and one of the trees is attached to the root of 
the other as a new branch (union), thus amalgamating 
two previously disjoint subsets or connected components 
of the graph. The forest structure can be realized with 
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FIG. 3. (Color online) Cluster labeling using union-and-find 
with balanced trees and (partial) path compression on a 64 x 
64 tile. Under insertion of the edge between sites 30 and 41, 
the smaller cluster with root No. 32 is attached to the root of 
the larger cluster at No. 13. 

an array of node labels, where each node is initialized to 
point to itself (i.e., it is its own root). This process is 
illustrated for the present application in Fig. [3] 

(Worst case) time complexity is trivially constant or 
0(1) for union steps, while find steps can be extensive, 
O(N), if edges connecting macroscopic clusters are con- 
sidered. (Storage requirements are clearly just linear in 
N.) The complexity of the find step can be reduced by 
two tricks, tree balancing and path compression. Bal- 
ancing can be achieved by making sure that always the 
smaller tree (in terms of the number of nodes) is attached 
to the larger. To this end, the current number of nodes 
is stored in the tree root. Balancing reduces the time to 
find the root to O(logiV) steps [41]. Similarly path com- 
pression, which redirects the "up" pointer of each node to 
point directly to the tree root in a backtracking operation 
after each completed find task, reduces find complexity 
to 0(log N). The combination of both techniques can be 
shown to result in an essentially constant find complexity 
for all practically relevant system sizes [23] • A n imple- 
mentation of the full algorithm geared towards cluster 
identification is described in Ref. [44] . 

Like the breadth-first search, the tree-based union-and- 
find approach is intrinsically serial as all operations work 
on the same forest structure, whose consistency could not 
be easily maintained under parallel operations. Moder- 
ate parallelsim is possible in the union step, where the 
two find operations for the vertices connected by the new 
edge can be performed in parallel. Due to the resultant 
thread divergence, however, using two threads per block 
is found to actually decrease performance. Similarly, the 
extra effect of path compression (keeping the stack for 
backtracking in fast shared memory) is found to actu- 
ally increase run times, at least in the range of tile sizes 
4 < B < 64 considered. The parallel scaling of the al- 
gorithm is thus the same (up to logarithmic corrections) 
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FIG. 4. (Color online) Cluster identification on a 64 x 64 tile 
using the self-labeling algorithm with one thread per 2x2 
spins. In every pass, each site examines its northward and 
eastward neighbors and, if they are connected by an active 
bonds, for each pair sets both labels to the minimum of the 
two current labels. 



as that of breadth- first search given in Eq. p0[ ). In fact, 
Ty ocai /i 2 is found to be almost perfectly linear in B 2 in the 
considered regime, cf. the data presented in Fig. [2] The 
asymptotic performance (neglecting logarithmic terms 
due to the find step) of the kernel local_unionf ind() 
on the GTX 480 is found to be T^JL 2 = 8.6 ns, some- 
what better than for local_BFS(). Note that for the 
tree-based algorithms of the union-and-find type, mem- 
ory accesses are inherently non-local leading to a certain 
performance penalty which hardly can be avoided. 



3. Self-labeling 

While breadth-first search and tree-based union-and- 
find are elegant and very efficient for serial implemen- 
tations, they appear not very suitable for parallelization, 
especially on GPUs where groups of threads are executed 
in perfect synchrony or lockstep and extensive thread di- 
vergence is expensive. An antipodal type of algorithm is 
given by the simple approach of self-labeling [23] : Clus- 
ter labels are initialized with the site numbers. Each site 
then compares its label to that of its eastward neigh- 
bor and sets its own and this neighbor's labels to the 
minimum of the two, provided that an activated bond 
connects the two sites. The same is subsequently done 
with respect to the northward neig hbor, cf. Fig.|4| (O n a 
simple cubic lattice, the analog procedure would involve 
three out of six neighbors.) Clearly, the outcome of a 
whole sweep of re-labeling events will depend on the or- 
der of operations and several passes through the tile will 
be necessary until the final cluster labels have propagated 
through the whole system. Eventually, however, no label 
will have changed in a whole pass through the tile and 
the procedure can be stopped, leading to a correct label- 



ing of clusters inside of each tile. Let us first concentrate 
on the spin model at criticality. Then, clusters typically 
span the tile, such that at least of the order of B sweeps 
will be required to pass information about correct cluster 
labels from one end of the tile to the other. In fact, even 
more passes are necessary, as information about cluster 
labels needs to be diffusively transmitted between each 
pair of sites in the same cluster. Since under the cho- 
sen dynamics this information will be transmitted along 
the shortest path connecting the two sites, the required 
number of sweeps will scale as 



n B ~B° 



(11) 



where c? m i n > 1 is the fractal dimension of the shortest 
path on a percolation cluster. For pure percolation (cor- 
responding to the q — > 1 limit of the Potts model) it is 
found to be c? m i n « 1.13 in d = 2 and <i m i n sa 1.34 in 
d = 3 [35J , whereas for the (Fortuin-Kasteleyn clusters of 
the) q — 2 and q — 3 Potts models in two dimensions it 
has been estimated as d m i n = 1.08(1) and d m i n = 1.01(1), 
respectively [46]. Obviously, the approach can be easily 
parallelized inside of tiles, assigning an individual thread 
to one or k > 1 spins. As a consequence, the parallel run 
time for the self-labeling approach is 



I 2 B 2 
T iocai = Clocal min^a ( „j ram{B* /k, m) 



B" 



(12) 



at or close to the percolation transition, which asymptot- 
ically appears to be rather unflattering as compared to 
the breadth-first search and union-and-find techniques. 
Due to the parallelization on the tile level, however, the 
total run time can still be quite low for intermediate tile 
sizes. Off criticality, the scaling becomes somewhat more 
favorable. Below the transition, where clusters span the 
lattice, but they are no longer fractal, d m i n should be re- 
placed by one. Above the transition, on the other hand, 
with a finite correlation length £, B dl " in in Eq. ( 12 ) is 
replaced by min(£,i?). While this somewhat better be- 
havior is probably not very relevant for the spin models 
as simulations close to criticality are the main purpose 
of cluster-update algorithms, it is of importance for per- 
colation simulations or image segmentation problems for 
the (typical) case of a finite characteristic length scale £. 

Figure [2] shows the scaling of parallel run times for the 
kernel local_self label () on tiles of sizes 4 < B < 64 
for the q — 2 Potts model at the critical point j3 c — 
In(l + V2). One can clearly distinguish two regimes with 
scaling Tf oc Jl 2 ~ B d — w B 108 for B 2 /k < m and 
T LJ £2 ~ B 2+d — w B 30S for B 2 /k > m. (The data 
in Fig. [2] are for k = 4 on the GTX 480 with m = 32, 
such that the crossover occurs at B m 11.) As is ap- 
parent from Fig. [2j for tile sizes B < 64 self-labeling is 
clearly superior in parallel performance on GPU as com- 
pared to breadth-first search or union-and-find, although 
it becomes less efficient than the latter two approaches 
for B > 128. I tested different variants: (a) an imple- 
mentation, local_self label_small () , that assigns one 
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spin per thread, restricting the tile size to B < 32 on 
current NVIDIA GPUs with a limitation of 1024 threads 
per block, (b) a kernel local_self label () which assigns 
a 2 x 2 block of spins to each thread, allowing tile sizes up 
to B = 64, and (c) a looped version, local_self label_- 
blockO, that assigns one column of height B to each 
thread, such that the lines are worked on through a loop. 
In all cases, the relevant portion of the bond activation 
variables and cluster labels are copied to shared memory, 
such that memory fetches in the re-labeling steps are (al- 
most) instantaneous. Bank conflicts are avoided through 
an appropriate layout of the data in shared memory. De- 
pending on the number of spins per thread, a different 
order of operations can lead to different results for each 
single self-labeling pass. Consistency could be enforced 
via atomic operations, but these slow down the code and 
are found to be not necessary here. Therefore, while the 
number of necessary self-labeling passes might vary from 
run to run (or device to device) depending on scheduling 
specificities, the final result is deterministic and does not 
depend on the order of operations. The decision about 
the end of self-labeling is taken using the warp vote func- 
tion __syncthreads_or () |30j which evaluates to true as 
long as any of the threads has seen a re-labeling event 
in the last pass. Performance differences between the 
mentioned three kernels arc found to be relatively small. 
The best asymptotic performance is observed for the ker- 
nel local_self label () with 2x2 spins per thread, as 
this setup avoids read/write conflicts in shared memory. 
For tiles of size B = 16 on the GTX 480 the run time 
per spin is T\ oca \/L 2 = 1.08 ns for all labeling passes. 
While the total number of operations is larger for self- 
labeling than for breadth-first search or union-and-find, 
the former is 13 and 8 times faster than the latter at 
B = 16, respectively, due to the easily exploited inherent 
parallelism. 



C. Tile consolidation 

Each of the three cluster labeling algorithms on tiles 
discussed above results in correct cluster labels inside 
of tiles, however, ignoring the information of any active 
bonds crossing tile boundaries. To reach unified labels 
for clusters spanning several tiles, an additional consoli- 
dation phase is necessary. Two alternatives, an iterative 
relaxation procedure and a hierarchical sewing scheme 
have been considered to this end. 



1. Label relaxation 

Cluster labels can be consolidated across tile bound- 
aries using a relaxation procedure similar to the self- 
labeling employed above inside of tiles [28]. In a prepa- 
ration step, for each edge crossing a tile boundary the 
indices of the cluster roots of the two sites connected 
by the boundary-crossing bond are stored in an array, cf. 




FIG. 5. (Color online) Tile consolidation via label relaxation. 
For each spin on the boundary of a tile (squares) with an off- 
tile active bond, the local root nodes (circles) are stored in 
an array. The corresponding local root labels are transmit- 
ted to neighboring tiles, who change their local labels to the 
minimum of their own and the received labels. 



Fig. [5] (kernel prepar e_relax ( ) ) . In the relaxation phase 
each tile sets the root labels of its own active boundary 
sites to the minimum of its own current label and that 
of the corresponding neighboring tile. Relaxation steps 
are repeated until local cluster labels do not change any 
further. Similar to self-labeling, the number of relaxation 
steps scales as the shortest path between two points on 
the largest cluster (s), however, the relevant length scale 
for the relaxation procedure is now I = L/ B, leading to 
the following scaling behavior at the percolation thresh- 
old 
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For systems below the transition temperature or more 
general cluster identification tasks with extensive, but 
non- fractal clusters, c? m i n is replaced by 1, whereas above 
the critical point and for other problems with finite char- 
acteristic length scales n ro i ax ~ (,/B. The number of 
iterations n re i a x is shown for a simulation of the q = 2 
Potts model at criticality in Fig. [6] The expected scal- 
ing with d m ; n = 1.08 [13] is well observed for sufficiently 
large system sizes across all tile sizes B: a fit of the func- 
tional form (13) results in d m i n = 1.0766. The small, 
but visible downward shift of n re i a x with increasing B re- 
sults from concurrency effects: for a small total number 
of tiles many of them are treated at the same time on 
different multiprocessors, resulting in the possibility of a 
label propagating to tiles more than one step away in one 
pass if (as is likely) several of the boundary sites belong 
to the same clusters. 

The number of operations per relaxation iteration is 
proportional to the length of the tile boundary times the 
number of tiles, i.e., 
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FIG. 6. (Color online) Required number n rc i ax of iterations 
for the label relaxation technique for tile consolidation as a 
function of the renor mali zed system size I = (L/B). The line 
is a fit of the form ( 13 1 to the data for I > 100, yielding 
dxnin = 1.0766. 



The relaxation routine (kernel relax ()) appears intrin- 
sically serial in nature as different boundary spins can 
point to the same roots such that concurrent operations 
could lead to inconsistencies, unless appropriate locks are 
used. Nevertheless, an alternative implementation (ker- 
nel relaxjmultithreadO ) using B threads to update a 
number of boundary spin pairs concurrently in a thread 
block is perfectly valid as similar to the self-labeling ap- 
proach only the number of necessary iterations is affected 
by the order of operations while the final result is not 
changed. As different blocks can essentially only be syn- 
chronized between kernel calls, the stopping criterion is 
checked on CPU in between kernel invocations. The par- 
allel run time for this kernel is then given by 
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Note that the asymptotic effort per spin from the relax- 
ation phase, Tg lobal /£ 2 ~ B^ x i d ^ oc L dmin , does not 
become constant as the system size is increased, unless 
the tile size B is scaled proportionally to L. 

For root finding in the spin-flipping phase, it is of some 
relevance that the relaxation process effectively attaches 
all sub-clusters in tiles belonging to the same global clus- 
ter directly to the root of the sub-cluster with the small- 
est cluster label. Therefore, the algorithm involves path 
compression on the level of the coarse grained lattice. 



FIG. 7. (Color online) Hierarchical sewing of 64 tiles for label 
consolidation. On level 1,2x2 tiles are sewn together to form 
16 larger tiles. In levels 2 and 3, the tile numbers are reduced 
to 4 and 1, respectively. 



2. Hierarchical sewing 

An alternative technique of label consolidation across 
tiles uses a hierarchical or divide-and-conquer approach 
as schematically depicted in Fig. [7] 23] . On the first level 
2x2 tiles of B x B spins are sewn together by inserting 
the missing bonds crossing tile boundaries. This results 
in B/2 x B/2 larger tiles which are then combined in a 
second step etc. until, finally, labels of the whole system 
have been consolidated. For the case of periodic bound- 
ary conditions, the bonds wrapping around the lattice in 
both directions need to be inserted in an additional step. 
Bond insertion itself is performed in the union-and-find 
manner described above using tree balancing, i.e., the 
roots of the two clusters connected by the added bond 
are identified and the smaller cluster is then attached 
to the root of the larger cluster. We can assume that 
find times are essentially constant inside of the original 
tiles of size B, either because tile labeling was performed 
with the breadth-first or self-labeling algorithms which 
produce labelings with complete path compression (i.e., 
each node label points directly to the root), or since it 
was done using union-and-find with (at least) one of the 
ingredients of tree balancing or path compression, lead- 
ing to (at most) logarithmic time complexity of finds. 
Then, using tree balancing in the hierarchical sewing step 
ensures that find times remain logarithmically small as 
tiles are combined. Time complexity could be further 
improved by adding path compression, but (as for union- 
and-find inside of tiles) it is found here that this rather 
slows down the code in the range of lattice sizes consid- 
ered here. Note that the self-labeling approach does not 
naturally provide the information about cluster sizes in 
the tree roots. It is found, however, that it has no adverse 
effect on the performance of the tile consolidation step if 
cluster sizes are simply assumed to be identical (and, for 
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exceed n. As the number of remaining tiles is reduced, 
the number of sewing jobs will drop to reach the number 
of multiprocessors at (£2~ k ) 2 =tior 



log 2 



(19) 
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where another approximation is made by allowing for 
non- integer level numbers k. Due to the variable number 
of multiprocessors actually involved in the calculation, 
the total parallel effort has two contributions, 
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label consolidation via hierarchical sewing as a function of 
system size L for different tile sizes B for the q = 2 critical 
Potts model on the GTX 480. The lines are fits of the form 
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global 



a/L + b/B to the data. 



simplicity, equal to one) for partial clusters inside of tiles. 

One thread block is assigned to a configuration of 2 x 2 
tiles at each level. The sewing itself is essentially serial 
in nature. For one of the two linear seams of each sewing 
step (the horizontal seam, say), one can use two threads, 
however, leading to two 2x1 tiles after finishing the 
horizontal seam which are only combined into one larger 
tile by closing the vertical seam [37]. As the tile size for 
the fcth generation is Bk = 2 k B and the length of the 
seam is 2 x 2 k B, the serial computational effort for level 
n of the sewing is 
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where I have neglected logarithmic terms due to the 
find operations. The total number of levels is fe max = 
\og 2 (L/B) (assuming, for simplicity, that L and B are 
powers of two) . Hence, the total serial effort is 
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FIG. 8. (Color online) Total parallel run times TT lobal for Therefore 



the effort T^JL 2 



j per site becomes asymp- 
totically independent of L, but this limit is approached 
rather slowly with a 1/L correction, whereas effects of 
incomplete loading of the device decay as 1/L 2 (in two 
dimensions). This is illustrated by the numerical results 
shown in Fig. [5J The data are well described by the form 
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expected from Eq. (20 1. Comparing Eqs. (20) and (21 1, 
from the ratio a/b of fit parameters one can deduce the 
effective number n of processing units as 



25 + 8a/b + ^25 + 16a /b 



32 



(22) 



and, for instance, the fit at constant B = 16 yields 
n ss 110, while a fit at constant L — 8192 results in 
n ss 113, rather close to the theoretically expected result 
for the GTX 480 with 8 blocks for each of the 15 mul- 
tiprocessors, resulting in 120 processing elements. The 
somewhat smaller n estimated are attributed to effects of 
thread divergence and the neglect of logarithmic terms in 
the find step. For tile size B = 16, the asymptotic perfor- 
mance of this kernel if found to be TZ ohgl / L 2 = 0.78 ns. 
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(17) 

On the GPU device with n multiprocessors mapped to 
independent blocks available for the sewing procedure, 
the parallel run time for generation k is 
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For a sufficiently large system, at the beginning of the 
process the number of tiles (£2~ k ) 2 to sew will always 



D. Cluster nipping 



Having globally identified the connected components 
resulting from the bond configuration, step 3 of the SW 
algorithm described at the beginning of Sec. [TT| consists of 
assigning new, random spin orientations to each cluster 
and adapting the orientation of each spin in the cluster to 
the new orientation prescribed. Since it is inconvenient 
to keep a separate list of global cluster roots, it is easiest 
to generate a new random spin orientation for each lattice 
site while only using this information at the cluster roots. 
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FIG. 9. (Color online) Optimal tile size B opt in cluster iden- 
tification with self-labeling and label relaxation for the q = 2 
states critical Potts model as a function of L. The solid line 



shows the result of Eq. ( 26 1 and the dashed line represents 



the optimum actually observed under the constraints B < 64, 
B = 2 n , n = 1, 2 



FIG. 10. (Color online) Total run times Tg W per spin in 
nanoseconds for the Swendsen-Wang update of the q — 2 
critical Potts model on the GTX 480 from self-labeling on tiles 
plus label consolidation with label relaxation and hierarchical 
sewing, respectively. Lines are guides to the eye. 



To this end, the array of now superfluous bond activation 
variables is re-used. In a first kernel, prepare_f lip() , a 
random orientation is drawn and stored in the bond ar- 
ray for each site. This is done in tiles of B x x B y sites as 
for the bond activation, using the same array of random- 
number generators. In a second step (kernel flipO), 
each site performs a find operation to identify its root and 
applies the new spin orientation found there to the local 
spin. Since cluster labels are effectively stored in a tree 
structure, this step involves non-local memory accesses 
for each site. In principle, locality could be improved 
here by employing full path compression in the union 
steps before, but in practice this is not found to improve 
performance for the system sizes up to 16 384 x 16 384 
considered here. Another possible improvement would 
eliminate the wasteful operation of drawing new proposed 
orientations for all spins while only the new orientations 
of the cluster roots are required. This can be achieved 
by carrying the flipping information piggy-back on the 
cluster labels, at least for the q — 2 or Ising model where 
flipping information is only one bit wide. Again, however, 
in practice it is found that due to the incurred compli- 
cations in the arithmetics regarding cluster labels in find 
and union operations, overall performance is actually de- 
creased by this "optimization" . Due to the necessary tree 
traversal, the performance of the cluster flipping proce- 
dure depends weakly on the degree of path compression 
performed previously in cluster labeling on tiles and label 
consolidation as well as on the tile size B. For the com- 
bination of self-labeling on tiles and hierarchical sewing, 
it is found to be TZ/L 2 = 0.201 ns for L = 8192 and 



B = 16, while it is somewhat smaller at 0.133 ns if label 
relaxation is used instead of hierarchical sewing. 

E. Performance and benchmarks 

As a number of options for the cluster identification 
task have been discussed, the question arises which of 
them is the most efficient for a given set of parameters 
and a given GPU device. For the bond activation and 
cluster flipping steps, the situation is simpler as no im- 
portant variants haven been discussed there, such that 
these steps are always performed with the outlined local 
approaches and tiles with B x = 256 and B y = 4, apart 
from the smallest systems with L < 256. Regarding the 
cluster labeling in tiles, it is clear from the data presented 
in Fig.[2]that self-labeling shows the best performance for 
block sizes B < 128. The main decision is thus between 
the label relaxation and hierarchical sewing approaches 
for label consolidation. Additionally, an optimal tile size 
needs to be determined. For the combination of self- 
labeling and hierarchical sewing, the total parallel run 
time for cluster identification is 
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assuming that B /k > to in Eq. (12 1. Here, a and b 
are the parameters from Eq. (21). On minimizing, the 



optimal tile size is then found to be 
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FIG. 11. (Color online) Break down of parallel run times 
for one SW update per spin into the components of bond 
activation, local labeling, label consolidation via the sewing 
approach and spin flipping. The CPU curve shows reference 
data for a serial implementation running on an Intel Core 2 
Quad Q6700 at 2.66 GHz. 



The fit parameters for the runs on the GTX 480 and 
rfmin = 1-08 then yield -B pt ~ 14.2. Since, for simplicity, 
runs were restricted to L and B being powers of two, 
B = 16 is closest to the optimum. Similar fits for the 
data on the GTX 580 and the Tesla M2070 also used for 
test runs yield the same optimum in the power-of-two 
step sizes. The pre-asymptotic branch with B 2 /k < m in 
Eq. ( 12 ) does not yield an optimum, but total run times 



are monotonously decreasing with B. In other words, as 
long as idle cores in the multiprocessors are available, the 
tile size should be increased. B = 16 hence is also the 
global optimum for this setup. 

For the combination of self-labeling and label relax- 
ation, the total run time for an update is 
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such that the optimal tile size becomes 
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which (with c? m i n ss 1.08) is approximately proportional 
to L 1 / 3 for the critical q — 2 Potts model in two dimen- 
sions. Figure [9] shows the resulting optimal tile size as 
a function of L. Due to the limitation of shared mem- 
ory to 48 kB on current NVIDIA GPUs, self-labeling on 
tiles is limited to block sizes B < 64 (assuming B = 2™, 
n = 1, 2, . ..), such that the optimal tile sizes cannot be 
used for L > 4096. Working directly in global memory 



FIG. 12. (Color online) Run times for the SW update on the 
GTX 480 as a function of the inverse temperature /3 for the 
relaxation and sewing approaches and different tile sizes. 



is no option as it slows down the code dramatically. Us- 
ing breadth-first search or union- and- find on larger tiles 
is feasible, but does significantly increase the total run 
time, even though the relaxation phase is slightly more 
efficient. I therefore did not increase the tile size beyond 
B = 64, as indicated in Fig. [9] 

The resulting total run times on the GTX 480 are 
shown in Fig. |10| The two consolidation approaches lead 
to quite different size dependence. Tile relaxation results 
in a rather fast decay of run times per site in the under- 
utilized regime and is faster than the sewing approach 
for intermediate system sizes. Eventually, however, the 
scaling 
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kicks in, which amounts to 
= 1.08, and results in the 
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[TO] For the hierarchical approach, 
on the other hand, as implied by Eq. (23) the best per- 



formance is reached only rather slowly as L is increased, 
but TP denti{y /L 2 ultimately becomes constant as (theoret- 
ically) L oo. At L — 8192 and /3 = fi c for the q = 2 
Potts model, SW with sewing performs at 3.18 ns per 
spin and per sweep on the GTX 480, while relaxation 
results in 7.15 ns per sweep. For the pure cluster iden- 
tification problem, i.e., without the bond activation and 
spin flipping steps, these times are reduced to 2.52 ns and 
6.56 ns, respectively. Figure [TT| shows the break down 
of run times into the algorithmic components of bond 
activation, labeling on tiles, tile consolidation and spin 
flipping when using hierarchical sewing. Label consoli- 
dation is the dominant contribution up to intermediate 
system sizes, and only for L > 16 384 its run time drops 



12 



T 



T 



30 - 



20 - 



■ GTX 480, relaxation 

■ GTX 580 
-V— GTX 480 

— B — Tesla M2070, ECC on 
-x— Tesla M2070, ECC off 




10 - 



FIG. 13. (Color online) Speed-up of the Swendsen-Wang up- 
date for the q — 2 critical Potts model on GPU as compared 
to CPU (Intel Q6700 at 2.66 GHz) as a function of system 
size L. The circles show results from using the relaxation pro- 
cedure while the remaining data sets are for the hierarchical 
sewing process on the GTX 480, GTX 580 and Tesla M2070 
GPUs, respectively. The latter is shown in variants with and 
without error-correcting code (ECC). 



below that of local labeling on tiles. For smaller sys- 
tems, the fraction of time spent on bond activation and 
spin flipping is negligible, while (due to the decrease in 
time spent for label consolidation) it rises to about 20% 
for L = 8192. As a reference, Fig. [IT] also shows the 
run time of an optimized, serial CPU implementation 
using breadth-first search and on-line flipping of spins as 
the clusters are grown, running on an Intel Core 2 Quad 
Q6700 at 2.66 GHz. 

The incipient percolating clusters for the Potts model 
simulations at f3 c are typical for a critical model. For 
other applications, for instance in image segmentation, 
it is interesting to investigate the performance for more 
general situations. Figure[l2]displays the run times for an 
SW update as a function of inverse temperature /?, com- 
paring the setups with relaxation and hierarchical sewing 
for tile consolidation. There is a natural increase in run 
time with the concentration p = 1 — exp(— /3 J) of bonds. 
While for the sewing procedure, run times increase mono- 
tonically with /3, for the relaxation approach there is a 
pronounced peak of run times near /3 = j3 c , where the 
number of necessary iterations n rc iax shoots up since now 
information about the incipient percolating cluster needs 
to be transmitted across the whole system. Run times 
become somewhat smaller again for (3 > j3 c as most bonds 
crossing tile boundaries belong to the same (percolating) 
cluster such that, due to concurrency, cluster labels can 
travel several steps in one iteration. This concurrency 
effect strongly increases as more tiles are treated simul- 



taneously which, for a fixed number of multiprocessors, 
is the case for larger tile sizes. Figure 12 shows that 
the preference of the sewing procedure over relaxation 
for large systems is robust with respect to variations in 
temperature and should also be justified for more general 
structures not resulting from a percolation transition. 

Figure [T3| shows the speed-up of the GPU implementa- 
tion with respect to the CPU code on the Q6700 proces- 
sor. For large systems, speed-ups in excess of 30 are ob- 
served. Comparing different GPU devices, a clear scaling 
with the number of multiprocessors and global memory 
bandwidth is observed with the best performance seen for 
the GTX 580 (n = 16, 192 GB/s), followed by the GTX 
480 (n = 15, 177 GB/s) and the Tesla M2070 (n = 14, 
144 GB/s). Naturally, effects of higher double-precision 
floating-point performance of the latter are not seen for 
the present code, which almost exclusively uses integer 
and a few single-precision floating point arithmetic in- 
structions. The penalty for activating error correction 
(ECC) on the memory is minute. Some benchmark re- 
sults, also including different processors, are collected in 
Tab. U 



III. WOLFF ALGORITHM 

For simulations of spin models, Wolff [5] suggested a 
variant of the Swendsen-Wang algorithm where only a 
single cluster, seeded at a randomly chosen site, is grown 
at a time which is then always flipped. Empirically, it is 
found that this leads to somewhat smaller autocorrela- 
tion times than SW [JH1 S3 but, most likely, no change 
in the dynamical critical exponent (at least for integer 
q) 50J . Conceptually, one can imagine the single-cluster 
algorithm as a variant of the SW dynamics where after 
a full decomposition of the lattice according to the SW 
prescription, a site is picked at random and the cluster 
of spins it belongs to is flipped. Since the probability of 
picking a specific cluster in this way is proportional to its 
size, in this approach larger clusters are flipped on aver- 
age than in the original SW algorithm. This explains the 
somewhat reduced autocorrelation times. 

While this approach is easily coded in a serial program 
and, in addition to the smaller autocorrelation times, in a 
suitable implementation performs at even somewhat less 
effort per spin than the SW algorithm, it is not straight- 
forwardly parallelized l5"Trf5"3"] . The only obvious par- 
allelism lies in the sites at the wave front of the growing 
cluster, cf. the sketch in Fig. [TJ A number of approaches 
for parallel calculations come to mind: 

(a) A full parallel cluster labeling as in SW, followed by 
picking out and flipping a single cluster. Although 
many operations are wasteful here, there might still 
be a speed-up as compared to the serial code. If us- 
ing a relaxation procedure for label consolidation, 
this approach can be somewhat improved by mod- 
ifying the stopping criterion to only focus on the 
labels belonging to the cluster to be flipped. 
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TABLE I. Benchmark results for the Swendsen-Wang update of the q = 2, q = 3 and q = 4 Potts models on GPU vs. CPU. 
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(b) Restriction to wave- front parallelism [ST]. Due to 
the rather variable number of sites at the front, 
however, this generically leads to poor load balanc- 
ing between the processing units. Load balancing 
can be improved by a de-localization of the wave 
front with a "randomized" rearrangement of the 
lattice. This can be reached, for instance, with a 
scattered strip partitioning, where strips of the lat- 
tice are assigned to available processing units in a 
round-robin fashion, leading to a more even distri- 
bution of sites at the wave front to different pro- 
cessors [52] . 

(c) Suitable modifications of the single-cluster algo- 
rithm to make it more appropriate for parallel com- 
putation. 

The approach (a) can be easily realized with the tech- 
niques outlined in Sec. [TTJ As discussed in Ref. [52], addi- 
tional load balancing can result in significant improve- 
ments on MIMD (multiple instruction, multiple data) 
machines. It appears less suitable for the mixed architec- 
ture of GPUs. In contrast to the more general case of SW 
dynamics discussed above in Sec. [TTJ I refrain here from a 
comprehensive evaluation of options, and only give some 
preliminary results for a modification (c) of the Wolff 
algorithm appearing suitable for GPU computing. 

In this approach, the lattice is again decomposed into 
tiles of edge length B. A single cluster per tile is then 
grown using a number of threads per tile to operate on 
the wave front. Unlike the case of the SW implementa- 
tion, the clusters are not confined to the tiles, but can 
grow to span the whole lattice. One can easily convince 
oneself, that the underlying decomposition remains to be 
the SW one. If seeds in different tiles turn out to belong 
to the same cluster, different parts of that cluster receive 
different labels, but since all clusters are flipped the ef- 
fect is the same as if a single cluster (for that two seeds) 
had been grown (this is for the case of the q = 2 model). 
Logically, this algorithm is identical to performing the 
full SW decomposition and then selecting £ 2 points on 
the lattice, followed by flipping all distinct clusters these 
points belong to. While this approach satisfies detailed 



balance (the SW decomposition remains the same and 
the cluster flipping probability is symmetric), it is not 
ergodic as it stands since, for instance, it becomes im- 
possible to flip only a single spin. This deficiency can be 
easily repaired, however, by assigning a flipping probabil- 
ity pflip < 1 to the clusters which can be large, but must 
be strictly smaller than one. If only a relatively small 
number of tiles is chosen, the decorrelation efficiency of 
this "few-cluster" variant of the SW algorithm is about 
the same as that of the single-cluster variant. 

For implementing the labeling in tiles, a number of 
threads p per block is chosen. If there are enough pend- 
ing sites in the queue, each thread is assigned one of 
these spins which are then examined in parallel. The 
queue is here realized as a simple linear array of size 
N = L 2 . This appears inefficient as the size of the wave 
front will at most be of the order of L d " , where du is the 
fractal dimension of the cluster boundary. In contrast 
to the use of a ring buffer of length oc L df1 , however, 
storing in and retrieving from the queue can be realized 
with atomic operations only [30], i.e., without the use of 
locks. Unfortunately, this setup severely limits the range 
of realizable tile sizes for larger systems as memory re- 
quirements for this queue scale as £ 2 N — L 4 B~ 2 . In 
contrast to the SW algorithm, bond activation and spin 
flipping can be done online with the labeling pass. Con- 
sequently, the "few-cluster" implementation needs only 
two kernels, cluster_tile() for the labeling and flip- 
ping and reset_inclus () for resetting the cluster labels 
after each pass. The number p of threads per block is 
adapted to maximize occupancy of the device. In gen- 
eral, it is found that good results are obtained on setting 

p = min [1024, 1536/ min (max(£ 2 /n, 1), 8)] , (27) 

as 1024 is the maximum number of threads per block, 
1536 is the maximum number of active threads and 8 the 
maximum number of resident blocks per multiprocessor. 
(Here, n denotes the number of multiprocessors of the 
device.) The resulting speed-ups as compared to a se- 
rial code on the Intel Core 2 Quad Q6700 are shown in 
Fig. |14[ The performance for large system sizes is lim- 
ited by the memory consumption of the queues, limiting 
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FIG. 14. (Color online) Speed-up of the "few-cluster" update 
described in Sec. |III| implemented on GPU as compared to a 
single-cluster update on CPU. For each system size, the opti- 
mal tile size B has been selected from the range of allowable 
tile sizes determined by memory constraints. 



the number £ 2 of tiles. Speed-ups by a factor of up to 
about five are achieved, significantly lower than for the 
SW dynamics. It is expected than further optimizations 
(such as the use of ring buffers instead of queues) could 
approximately double this speed-up. Nevertheless, for 
cluster-update simulations on GPUs it might be more 
efficient to stick with the SW algorithm. 



ently non-local in nature, the choice of appropriate algo- 
rithms for implementations on GPU allows for significant 
performance gains as compared to serial codes on CPU. 
The overall speed-up is seen to be lowest for spin mod- 
els at criticality, where clusters are fractal and span the 
system. In all cases, however, speed-ups up to about 30 
can be achieved on current GPU devices. This is to be 
contrasted to the case of purely local algorithms, such as 
Metropolis simulations of spin models, where speed-ups 
are seen to be larger by a factor three to five [HI [2Ql [21] . 
Even with this caveat, it seems clear that GPU com- 
puting is not limited to the case of purely local prob- 
lems as significant performance gains can be achieved for 
highly non-local problems also. Generalizations within 
the realm of spin-model simulations, such as variants on 
different lattices or embedded clusters for O(n) spin mod- 
els [5] are straightforward. 

While the considerations presented here have been re- 
stricted to calculations on a single GPU, it should be 
clear that the approach outlined for the Swendsen-Wang 
dynamics or the pure cluster identification problem is 
easily parallelized across several GPUs. For the case of 
spin-model simulations, the combination of self-labeling 
and label relaxation appears better suited for this task 
as for the final spin-flipping step only information local 
to each GPU is required, whereas for the hierarchical 
scheme cluster roots (and therefore spin-flipping states) 
are scattered throughout the whole system. The most 
effective setup for simulating large systems, therefore ap- 
pears to be the combination of self-labeling and hierarchi- 
cal sewing inside of a GPU and label relaxation between 
GPUs which can easily be realized using MPI with rather 
low communication overheads. 



CONCLUSIONS 

Cluster identification is a pivotal application in scien- 
tific computations with applications in the simulation of 
spin models and percolation, image processing or net- 
work analysis. While the underlying problem is inhcr- 
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